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Multidimensional Stochastic Approximation 
Using Locally Contractive Functions 

1. Summary . A Robbins-Monro type multidimensional stochastic approximation 
algorithm which converges in mean square and with probability one to the fixed 
point of a locally contractive regression function is developed. The algorithm 
is applied to obtain maximum likelihood estimates of the parameters for a mixture 
of multivariate normal distributions. 

2. Introduction . Let be. real k-dimensional Euclidean space with inner 

product denoted by < , > and norm denoted by // ' )| . Corresponding to every 
positive definite real kxk matrix B we define the B-inner product 
<x,y>g = <x,By> and the B-norm II ^ g = <x,Bx>^^^, for x,y e E^. A 
function F:D — ^ where. D is an open subset of Ej^, is locally contractive 

at a point 6 e D if there, exists a B-norm on E^^ and a number X, 

0 < X < 1 such that 

(2.1) 1/0** - F(e) i/g < x//e - eijg 

whenever 9 is sufficiently near 9 . If the above inequality holds for every 
0 in some neighborhood W of ^ , we say F is X-locally contractive at 
9 throughout W . Clearly, 9 will be a unique fixed point in W for F. 

For amy kxk matrix A, let the spectral radius of A be denoted by 
p (A) = sup{[X| :X is an eigenvalue c/f A}. The Frechet derivative of F, if 
it exists, will be denoted by VF. The following result, a consequence of 
Taylor’s theorem and the theory in [2; section 2.3], will be used in part 
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A of this paper. 

(2.2) Lemma If VF exists and is continuous in a neighborhood of 6°, a 
necessary and sufficient condition for F to be locally contractive at 0'’ 
is that F(6‘') = 0® and p(VF(0°)) < 1. 

Let {Y(0):0 e D} be a family of random variables with values in 
satisfying the following conditions 

— 2 

(2.3) sup E( Y(0) ) < 00 (E denotes conditional expectation with 0 fixed) 

0eD 

(2.4) the regression function of {Y(0)}, denoted by M(9) = E(Y(0)), can 
be expressed as M(0) = 0 - F(0) where F(0) is locally contractive 
at 0 e D. 

In part 3 of this paper we develop an algorithm which, given the conditions 
above and given a sufficiently close approxJjnation to 0 , yields a sequence 
of recursively defined random variables with values in which converges to 

0^^ in mean square and with probability one. 

In part A of this paper, this algorithm is used to formulate a stochastic 
analog of the iterative procedure developed by Peters and Walker in [3] for 
obtaining maximum likelihood estimates of the parameters for a mixture of 
normal distributions. 

(3) Derivation of the Algorithm 

(3.1) Lemma Let = (p(b“^))"^ and p^ = p(B) where B is a positive 
definite kxk matrix, and let r^,r^ be positive real numbers such that 



3 


^1 . 


< — . Let 0*^ , 9 e E, such that J|6^ - 9// < r_ and let 
’ h 1 


S = {0 e “ 0 II ^ ^9^' Define a function ^ ^ follows 


f 0 if 0 6 S 
(0 ( 0 ) = { _ 

■’ / (l-t)0 + t0 where t = -3 if 0 <f S 

‘ lie-eil 


Tlien the following inequality holds for every 0 e 


||e° - (9)i|2 s|le‘- 0 || 


Proof. We may assume t = < 1 for otherwise 0 e S implying 

1)9-0 II 


^(9) = 0. By the fundamental theorem of calculus; 


fe=l 

|0* - 0|I^ = 119 ' -^(0)/fg + ^[||e“- <l-s)9 - S0|(g]ds 

us=t 


It suffices to prove that the integrand is nonnegative for all s, t ^ S < 1. 


Consider 


^[|l0 - (l-s>9 - sQ\\h = 2<9 - 0,0*- (l-s)0 - s9>g 


2s<0 - 0,9 - 0> + 2<0 - 9,0 - 0>,,. 

0 O 


Since t < s < 1, by the principle axis theorem for real symmetric matrices, 
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the left side of the above expression is bounded below by 2 r |)9-0/| p . 
Similarly, the right hand side of the above expression is bounded, in absolute 
value, by 2 rj|6-9)jp2. The result follows from the hypothesis concerning 

For the remainder of part 3 we adopt the notation and assumptions stated 
previously. Furthermore, we assume that F is locally contractive at 9* 
throughout S and that S _c D. Define a family {Y(9) :9 e Ej^} of random 
variables by 

(3.2) Y(9) = Y(^(0)) - ^(9) + 9 

Then the following inequality is valid for every e > 0. 

(3.3) Corollary inf E(<9-0* . Y(9)>„) >0 

e< |l9-9'||j ® 

Proof . Since E(Y(9)) = (6) - F(<^(9)) - ^(6) + 9, the expression above s= 

inf {<0-9 ,0-0 > + <9-0 ,0“- F((f(0))> }. The first term above = jl0-9'^j| 
es iie-eiiB BIB B 

and the second term is bounded above in absolute value by 

110-e\ f)F(<?(e)) - 

5 Xil0-0*llg 11^(9) - 0"||g (by (2.D) 

(by (3.1)) 

Therefore inf E(<6-0* , Y(0)> ) ^ (1 - X)e^ > 0. 

1)9-0113 
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(3.4) Definition . A Rain sequence is a sequence {a^^} of positive numbers 

OO 00 

satisfying = °° and a? < 

^ £=1 


(3.5) Remark . For any G > 0, {a^ = •!•} is a gain sequence since 

V 00 2 

p <2 CTT 

limit zL IT ~ limit c log K = ^ and ^ a„ = — ^ . 

k ^ » 1=1 ^ fc » 1=1 *■ ® 


(3.6) Theorem . Let {Y(6);0 e be as in (3.2) and let {a^} be a gain 
sequence. Then the following sequence of recursively defined random vectors 


( 3 . 7 ) - a^Y(B^)y 6 ^ 

and with probability one to 9 . 


arbitrarily chosen, converges in mean square 


Proof . We refer the reader to the algorithm described in [l,pp 332-333] and 
the convergence proof given in the appendi?c to [l»pp 350-352]. Replacing 
their gain sequence {p^} with the gain sequence replacing their 

norm /| • /| and inner product <V,W> = V' W (where V’ denotes the transpose 
of the vector V and V' W denotes matrix multiplication) with the B-norm 
and B- inner product respectively, the theorem will follow once we verify that 
conditions (Al) - (A3) in [l,pp 332-333] are satisfied. 

(Al) Since E(Y(9)) = 9 - F(cj(9)), this result follows from (3.3) since B 
positive definite implies ||9-9 = 0 if and only if 9 = 9. 


(A2) Follows from (3.3). 

(A3) E(|lY(9)||g) = E(<^(^(0)) -^(0) + e, Y(^(0)) -^(0) + 0>g) 

^ h(l + |10-0J}3) for sufficiently large h > 0 since S D and by (2.3) 
B 

sup E(||Y(0)jl^) < 

0eD 
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4. Application to Maximum Likelihood Estimation 

in 

Let D = {(a , y , 2.)}., „ where, each a > 0 with ^ a = 1, 

1 X XXXj»««jul X X”X 

each e and each is a positive definite real symmetric nXn matrix. 

We consider D to be realized as an open subset D of where 

k = m(n+l) (n+2) _ g _ (a . ,y , ,E . ) let x(0) be a random 

2 , XXXXX^***)!!! 

variable with values in r” and with distribution function 


m 


P(9,x) - X C(. p. (x) 
i=l ^ ^ 


for X e R 


n 


where 


p^(0,x) = (27T)""''^|E.^r^'^^exp{- |(x-y^)'^Z^^(x-y.)} 


for each i = l,...,m. 

Fix 0° e D and let {x, } m S be an independent sample of observations 

of x(0*). A maximum-likelihood estimate of 0*^ based on "txj^} is a choice 
of 0 e D which locally maximizes the log-likel.lhood function 

N 

L = 5^ log p(0,x.) 
k=l, 

In the appendix to [’3] , Peters and Walker prove there exists a sufficiently 
small neighborhood of 0°, such that with probability -+■ 1 as N there 

p 

exists a unique maximum- likelihood estimate of 0 in that neighborhood. 
Furthermore, with probability ^1 as N this estimate ->• 0 . This 

estimate will be called the consistent maximum- likelihood estimate (which we 
abbreviate by c.m. l.e.) 
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Equating VgL = 0 and performing algebraic manipulation of the resulting 

C 

equations, yields the following necessary conditions for a m.l. e. 6 of 6 . 
6 = (0) where J:D D is defined by 


1-1..... m 


where, for each i = l,...,m 


(4.1) 


(4.2) 


(4.3) 


1 N p(Xj^) 


w; 




, rl I VV, /l I 

*• k?i 

= /I 4 . - w . iT A 5 VVl 

N 1 1 p(X|^) / N p(Xj^) 


where each p^ and p is evaluated with respect to the parameters 

6 = (a.,u.«2,). , .T will be called the likelihood function, 

i* i* i 1 = 1 , ...,m 

In [3], Peters and' Walker develop an iterative procedure which, starting 

with any initial estimate 0' which is sufficiently close to 0* , yields a 

sequence in D converging to the c.m.l. e. of 0* based on 1 ^* 

4 

Their technique consists in proving that, for e < •' function 


(4.4) 


I_(9) = (l-e)e + £ J ( 0 ) 


is locally contractive (at the c.m.l. e. of 0 ) throughout a neighborhood of 
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6* . Thus, for any 0' in this neighborhood of 9 , the sequence defined 

recursively by 


(4.5) 


'i+1 


f,(e,) . e,= 6- 


converges to the c.m.l.e. of 6 . 

In concluding, they discuss the computational advantages of this procedure 
as compared to classical numerical techniques such as Newton's method or the 
method of scoring. In particular, the procedure satisfies the following con- 
ditions. 


(4.6) At each stage of the iteration in (4.5), the constraints on the 
parameters in 9 are satisfied. 

(4.7) The 'step size' e depends only on n and m and not on 0 . 

(4.8) The procedure does not require the inversion, at each stage of the 
iteration, of a kxk matrix. 

We will present a stochastic approximation analog of the iterative procedure 
defined by (4.5). In contrast to the classical stochastic method of scoring, 
our procedure satisfies the conditions (4.6) and (4.7). A step size e 
will not appear explicitly in our algorithm. 

Fix b** e D and let oo an infinite sequence of independent 

samples of observations of x(9^ ) . For any function g from to any real 

vector space V, let 


E(g) = . 
U 


g(x)p(0* ,x) dx 


.n 
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ff 

be the expectation, if it exists, of gox(6 ). (o denotes composition of 
functions). Then, by the strong law of large numbers, with probability one 
as N the equations (4.1) - (4.3) converge to 


(4.9) 


a! = a,E(-i) 
X ip 


(4.10) 


(4.11) 


Pi Pi 
= E(x ^)/E0 




We denote the corresponding limiting value of the likelihood function by . 
Clearly ^(6) is a continuously differentiable function of 0 and (0 ) =9 I 
also by' (4. 4) , I g,(0) = (l-e)0 + e 5^(9) is locally contractive at 0 . By 
(2.2), p(V^^(0*')) < 1, implying that the eigenvalues of Vj^(9°) have real 

parts strictly less than 1. Now define a function d:D D by 

“ ^“i*“i^i’‘^i^i^ i=l, . . . ,m’ 


,-l 


Clearly d is a differentiable function from D D such that d exists 
and is differentiable. Define a function' :D ->■ D by (0) = d J ♦ d ^(0). 
By the chain rule for Frechet Derivatives , 


Ox ,-l 


Vf'(d(0‘')) = [Vd(0^)][V5(0")][Vd(0‘’)J 
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hence the function 


(A. 13) 


■'g(e) = (l-e)6 + ero) 


is locally contractive throughout some neighborhood W of d(6 ). 

Define a familj^ {Y(6);6 e D} of random variables with values in E, by 


(A.IA) 


~ c Pi ^ T Pi 

Y(6) = s(a.-a. “> “^i^i , 


where = p^(d ^(0),x(0°)) and p' = p(d ^(0) ,x(0*’)) . 

Then E(Y(0)) = 0 - (0) . Therefore, the family {Y(0):0 e D} satisfies 

conditions (2.3) and (2.4). 

Let {Y(0) :0 e E^^} be constructed from {Y(0):0 e D} as in (3.2) and 
let sequence. Then by (3.6), the sequence in (3.7) 

converges in mean square and with probability one to d(0 ). Since is 

a gain sequence whenever is a gain sequence, e need not appear explicitly 

in the sequence in (3.7). 



